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^ j Abstract 

Manifold matching works to identify embeddings of multiple disparate data 
spaces into the same low-dimensional space, where joint inference can be pur- 
sued. It is an enabling methodology for fusion and inference from multiple 
and massive disparate data sources. In this paper we focus on a method 
called Canonical Correlation Analysis (CCA) and its generalization General- 
ized Canonical Correlation Analysis (GCCA), which belong to the more general 
Reduced Rank Regression (RRR) framework. We present an efficiency investi- 
gation of CCA and GCCA under different training conditions for a particular 
text document classification task. 
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1. Introduction 

1.1. Purpose 

l* In the real world, one single object may have different representations in 

. £h different domains. For example, the Declaration of Independence has versions 

^ translated into different languages. Let n denote the number of objects Oi, i = 

1, . . . , n, and K be the number of domains. Then we have 

Xji ~ • • • ~ x ife ~ • • • ~ x iK , i = l,...,n (1) 

where the ith object Oi has K measurements x^, k — 1, . . . ,K; x^ € Sfc is the 
representation for object Oi in space Sfc. 
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Figure 1: Classification problem 

The problem explored in this paper is that for m new objects 0[, i = 
1, . . . , m, how to classify their representations G given the representations 
Yi'k' € Sfc/ with fc 7^ k' . For this task, x^, x^/, i = 1, . . . ,n, described above 
are needed to learn the relation between 3^ and Sj./ so that we can map data 
from 5^ and S^/ to a common space \- Thus x^, x^/ are the domain relation 
learning training data. In our scenario, we are interested in a particular setting 
that the data to be classified is in separated classes different from the data 
used to learn the low dimensional manifold. This is shown in Figure [l] where 
disks represent the domain relation learning training data Xjfc, x^.* and squares 
denote the classifier training and testing data yik, Yi'k' ■ A classification rule g is 
trained on yi'k' and applied on yn~. We consider one domain relation learning 
method, Canonical Correlation Analysis (CCA) [7], which can be carried 
out using reduced-rank regression routines [H [10] . We investigate classification 
performance in the common space \ obtained via CCA, training the classifier 
on yi'k' and testing on y^. The focus of this paper is not on optimizing the 
classifier; rather, we investigate performance for a given clasifier (5-Nearest 
Neighbor) as a function of the number of domain relation learning training 
data observation n used to learn \- The main contribution of this paper is an 
investigation of the notion of supplementing the training data of classifier by 
using data from other disparate sources/spaces. 

1 . 2. Summary 

The structure of the paper is as follows: Section [2] talks about related work. 
Section [3] discusses the methods employed, including the manifold matching 
framework as well as embedding and classification details. Experimental setup 
and results are presented in section |4j Section [5] is the conclusion. 
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2. Background 

Different methods of transfer learning, multitask learning and domain adap- 
tation are discussed in a recent survey [19] . There are algorithms developed on 
unsupervised document clustering where training and testing data are of differ- 
ent kinds |12j . The problem explored in this paper can be viewed as a domain 
adaptation problem, for which the training and testing data of the classifier are 
from different domains. When the classification is on the text documents in 
different languages, as described in the later sections of this paper, it is called 
cross-language text classification. There is much work on inducing correspon- 
dences between different language pairs, including using bilingual dictionaries 
[T8] . latent semantic analysis (LSA) features [5], kernel canonical correlation 
analysis (KCCA) jTJj, etc. Machine translation is also involved in the cross- 
language text classification, which translates the documents into a single domain 

punning. 

3. Method 

In this paper, we focus on manifold matching. The whole procedure can be 
divided into the following steps: 

• For each single space , calculate the dissimilarity matrix for all domain 
relation learning training data observations Oi. 

• For each k, use Multidimensional Scaling (MDS) on the dissimilarity ma- 
trix to get a Euclidean representation E^. 

• Run CCA (for K = 2) or Generalized CCA (K > 2) to map the collection 
Ei, ... , Ek to a common space \. 

• Pursue joint inference (i.e. classification) in the common space x- 

This procedure combines MDS and (Generalized) CCA in a sequential way. 
Firstly MDS is applied to learn low-dimensional manifolds, then (Generalized) 
CCA is used to match those manifolds to obtain a common space. 

This paper focuses on manifold matching and it demonstrates the classifica- 
tion improvement via fusing data from additional space to learn the common low 
dimensional manifold. It is interesting to investigate how to generate the low 
dimensional space using all data instead of matching separate manifolds. But 
this requires calculating the dissimilarity information for the objects' represen- 
tation in different spaces properly for the multi-dimensional scaling purpose. 
This issue had been investigated, e.g., Q~71[23]> but there had not been any clear 
answer. 

3.1. Manifold Matching Framework 



The framework structure for manifold matching is shown in Figure [2] 17] [23]. 

For each of the n objects Oi € S, i = 1, . . . ,n, there are K representations 
Xjfe G Hfc, k = 1,...,K generated by the mappings tt^. Manifold matching 
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X=M d 



Figure 2: Manifold matching model 

works to find p\, . . . , px to map x»i, . . . , Xjj<- to a low-dimensional common 
space x = K d : 

Xj fe = pk(x ik ), i = l,...,n, k = l,...,K. (2) 

After learning the p^s, we can map a new measurement € into the 
common space x = K d via: 

yk = pk{jk) (3) 

This allows joint inference to proceed in M. d . 
3.2. Embedding 

The work described in this paper is based on dissimilarity measures. Let 8k 
denote the dissimilarity measure in the fcth space 3/., and 5 be the Euclidean 
distance in the common space M. d . There are two kinds of mapping errors 
induced by the pkS: fidelity error and commensurability error. 

Fidelity measures how well the original dissimilarities are preserved in the 
mapping x^. i— > Xjfc, and the fidelity error is defined as the within-condition 
squared error: 

e A- = 7HV (H*ik,Xjk) - 4(x lfc ,x ife )) 2 (4) 

\2/ l<j<j<n 

Commensurability measures how well the matchedness is preserved in the 
mapping, and the commensurability error is defined as the between-condition 
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squared error: 

4 lfe2 = ^ E *ifa)) a (5) 

l<i<n 

Multidimensional Scaling (MDS) [31 [5] works to get a Euclidean repre- 
sentation while approximately preserving the dissimilarities. Given the n X n 
dissimilarity matrix A^ = [<5fc(xjfe, Xjfc)] in space Sfc, multidimensional scaling 
generates embeddings x- fc € R d ' for x,*fc eSj, i = 1, . . . , n, k = 1, . . . , K, which 
attempts to optimize fidelity, that is, ||x^ fc — x^ fc || ss Sk(xik,Xjk)- 

For the K = 2 case, multidimensional scaling generates n x d! matrices X[ 
from Ai and X' 2 from A2. The ith row vector ic' ik of A£ is the multidimensional 
scaling embedding for Xj/.. 

Canonical correlation analysis is applied to the multidimensional scaling re- 
sults. Canonical correlation works to find d! x d matrices U\ : X[ >-» X\ and 
f/2 : i-> X2 as the linear mapping method to maximize correlation for the 
mappings into E d , where two matices satisfy U^U\ = I and U2U2 = I- That is, 
for the Zth (1 < I < d) dimension, the mapping process is defined by and u 2 , 
the lih column vector of U\ and U2 respectively. The orthonormal requirement 
on the columns of U\ (similarly U2) implies that the correlation between differ- 
ent dimensions of the embedding is 0. The correlation of the mapping data is 
calculated as 

(X[u[) T (X>u^ 



Pl = n *ri mi y, , „ ( 6 ) 



X[n[ HI! A>< 



which is equivalent to 
subject to 



Pl 



= (Xiu[) T (X' 2 u 2 ) (7) 



(X[u[) T (X[u[) = (A 2 u 2 ) T (A > 2U2) = 1 (8) 
And the constraint can be proved to be equivalent to 

(xX) r (*X) + (*X) T (*X) = 1 (9) 

For CCA it holds p x > p 2 > . . . > Pd- 

For new data y^, k = 1,2, out-of-sample embedding for multidimensional 
scaling [TJ 127) generates d! dimensional row vector y' k . The final embeddings in 
the common space M d are given by yi = y[Ui and y2 = y' 2 U2- 

Canonical correlation analysis optimizes commensurability without regard 
for fidelity [23] . For our work, first we use multidimensional scaling to generate a 
fidelity-inspired Euclidean representation, and then we use canonical correlation 
analysis to enforce low dimensional commensurability. 

Canonical correlation analysis is developed as a way of measuring the corre- 
lation of two multivariate data sets, and it can be formulated as a generalized 
eigenvalue problem. The expansion of canonical correlation analysis to more 
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than two multivariate data sets is also available [15] . which is called General- 
ized Canonical Correlation Analysis (GCCA). Generalized canonical correlation 
analysis simultaneously find Ui : X[ t-t X\ , . . . , Uk '■ X' K i-> A^ to map the 
multivariate data sets in K spaces to the common space M. d . Similarly for the 
new data y^, k = 1,...,K, we can get their representations in the common 
space R d as yi = y[Ui, . . . , yx = y'kUk- Similar to CCA, the correlation of 
data in the Ith mapping dimension is calculated as [28) 



1 

Pi 



K 



K(K - 1 
subject to 

K 



£ (x>y g f(x> h u l h ) (io) 

g,h=l 



i£(A><f(x><) = i (ii) 



9 = 



GCCA can be formulated as a generalized eigenvalue problem. Different 
algorithms have been developed as the solution, e.g. least square regression. 
For the particular dataset used in our experiments, because it is not very large, 
we can perform eigenvalue decomposition on the respective matrices directly. 

3.3. Classification 

Given the measurements of m new data points y^., i = l,...,m, k = 



1,...,K, (generalized) canonical correlation analysis in section 3.2 yields the 
embeddings y^ in the common space M. d . To classify y^, instead of using data 
points from the same space (i.e. yVfe, i' ^ i), we consider the problem in 
which we must borrow the embeddings from another space 3^/ for training, that 
is, yVfc', i' 7^ i, k! 7^ k. This problem is motivated by the fact that in many 
situations there is a lack of training data in the space where the testing data lie. 

3.4- Efficiency Investigation 

We investigate the effect of the number of domain relation learning training 
data observations on the classification performance. 



4. Experiments Results 

4-.1. Dataset 

Our experiments apply canonical correlation analysis and its generalization 
to text document classification. The dataset is obtained from wikipedia, an 
open-source multilingual web-based encyclopedia with around 19 million articles 
in more than 280 languages. Each document may have links pointing to other 
documents in the same language which explain certain terms in its content as 
well as the documents in other languages for the same subject. Articles of the 
same subject in different languages are not necessarily the exact translations of 
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one another. They can be written by different people and their contents can 
differ significantly. 

English articles within a 2-neighborhood of the English article "Algebraic 
Geometry" are collected. The corresponding French documents of those English 
ones are also collected. So this data set can be viewed as a two space case: 
Si is the English space and S2 is the French space. There are in total 1382 
documents in each space. That is, zi t i, . . . , Zi 38 2,i £ Si, and Zi i2 , ■ ■ • , Zi382,2 £ 
S2. Note that z^., i = 1, ...,1382, k = 1,2 includes both domain relation 
learning training data x^, i = 1, . . . ,n and new data points y^., i = 1, . . . , m 
(to + n = 1382) used for classification training and testing. 

All 1382 documents are manually labeled into 5 disjoint classes (0 — 4) based 
on their topics. The topics are category, people, locations, date and math things 
respectively. There are 119 documents in class 0, 372 documents in class 1, 270 
documents in class 2, 191 documents in class 3, and 430 documents in class 
4. The documents in classes 0,2,4 are the domain relation learning training 
data x^, i = 1, . . . , n, k = 1,2. There are in total 819 documents in those 3 
classes (n = 819). The 563 (to = 563) documents in classes 1,3 are the new 
data yik, i — 1, . . . , to, k = 1,2. They are used to train a classifier and run the 
classification test. 

4.2. Dissimilarity Matrix 

The method described in section [3~2"1 starts with the dissimilarity matrix. For 
our work two different kinds of dissimilarity measures are considered: text con- 
tent dissimilarity matrix Aj. and graph topology dissimilarity matrix A?. Both 
matrices are of dimension 1382 x 1382, containing the dissimilarity information 
for all data points Zifc, . . . , Zi3g2fc. 

Graphs Gk(V,Ek) can be constructed to describe the dataset; V represents 
the set of vertices which are the 1382 wikipedia documents, and Ek is the set 
of edges connecting those documents in language k. 

The entry A?(i,j) £ A 9 k is the number of steps on the shortest path 
from document i to document j in Gk- In the English space Si, Af(i,j) £ 
{0, . . . , 4}, where the 4 comes from the 2-neighborhood document collection. In 
the French space Eg, Zi2 £ S 2 is the document in French corresponding to the 
document Zn £ Si , and A| £ Aj depends on the French graph connections. 
It is possible that Af(i,i) 7^ Af(i,j). At the extreme end, A|(z,j) = 00 when 
Zj2 and Zj2 are not connected. We set A|(t, j) = 6 for Af(i, j) > 4. 

A| £ Aj. is based on the text processing features for documents z^ , zjk £ 
Sfe. Given the feature vectors fik,fjk, A|(i,j) is calculated by the cosine dis- 
similarity Aj,(i,j) = 1 — \\f-l\\ 2 \\f k k \\ 2 - F° r our experiments, we consider three 
different features for f: mutual information (MI) features [T51 [501 HI], term 
frequency-inverse document frequency (TFIDF) features |25j and latent seman- 
tic indexing (LSI) features 0]. The wikipedia dataset used in the experiments 
are available online R See the paper [52] for more details/description. 



http : / /www. cis . jhu. edu/~zma/zmisi09 .html 
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English graph 




i i i i i r 

10 20 30 40 50 
eigenvalue index 




10 20 30 40 50 



eigenvalue index 



French text 
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Figure 3: Square root of eigenvalues for covariance matrix (all data used) 



4-. 3. Embedding Dimension Selection 

To choose the dimension d for the common space K d , we pick a sufficiently 
large dimension and embed A£ and A? via multidimensional scaling. The scree 
plot for the MDS embedding is shown in Fig[3](term frequency-inverse document 
frequency features are used for the text dissimilarity calculation). 

Based on the plots in Figure [3j we choose d = 15 for the dimension of the 
joint space \, which is low but preserves most of the variance This model 
selection choice of dimension is an important issue in its own right; for this 
paper, we fix d = 15 throughout. 

For the canonical correlation analysis step, since it requires to multidimen- 
sional scale the dissimilarity matrices to dl at the beginning, as described in 
section 3.2 when we choose different number n! of domain relation learning 
training documents, d! depends on n' . The choice of dimension is once again 
an important model selection problem; for this paper, the values of d! with 
different n' are shown in Table [TJ We believe that the values of d' are chosen 
large enough to preserve most of the structure yet still small enough to avoid 
dimensions of pure noise which might deteriorate the following (G)CCA step. 
The second column indicates what percentage of the total manifold matching 
training data is used. 
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Table 1: MDS Dimensions 



n' 


% of n 


d! 


82 


10% 


40 


164 


20% 


80 


246 


30% 


100 


328 


40% 


100 


410 


50% 


150 


491 


60% 


150 


573 


70% 


150 


655 


80% 


200 


737 


90% 


200 


819 


100% 


200 



4-4- Classification Performance 

The classifier used in the experiment is k- nearest neighbor (k-NN). The 
class label of the test data is assigned by the majority class label of the k closest 
training data points. The distance used is the usual Euclidean distance. For 
our experiments we use the 5-nearest neighbor classifier (We do not claim that 
K = 5-NN is optimal for our experimental data. Rather, it is, illustrative; the 
goal of our experiments is to demonstrate the utility of using disparate domain 
relation learning training documents via GCCA). 

There are 563 new data points y^fe in classes 1 and 3. Class 1 has 372 
data points, and the remaining 191 have class label 3. For each nf in Table [l] 
we randomly sample n' out of the total 819 domain relation learning training 
documents to learn the common space R d into which we project the new data 
points. The classification is run in a leave-one-out way. We use 200 Monte Carlo 
replicates to calculate the average performance. 

The method described in section 3.2 generates the embeddings yik € M. 15 , i — 
1, . . . , 563, k = 1,2. Because there are two kinds of dissimilarity matrices con- 
sidered, we have Ajj. i-> y' fc and A^ i-> y? fe . The training and testing data can 
be chosen from not only different spaces (i.e. English space and French space), 
but also from different dissimilarity measures (i.e. text content dissimilarity 
and graph topology dissimilarity). Classification results are shown in Figures 
|4a] [4b| and |4cJ Note that we use different text document processing features to 



calculate At. Figures 4a 4b and 4c are based on the latent semantic indexing, 



term frequency-inverse document frequency and mutual information features 
respectively. 

For all three figures, the ir-axis label S indicates what proportion of the total 
n data points are used for domain relation learning training, that is, S = — ; 
the y-&xis is classification accuracy. 
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To get the solid circle curve, A 2 is used for training and Af is for testing, 
thus x.f k , i = 1, . . . , n' , k = 1,2 arc employed to learn the manifold matching 
methods. For each test data point yf 1; i € {1, . . . , to}, the 5-NN classifier is 
trained on yf/ 2 , i' = 1, . . . , i — 1, i + 1, . . . ,tti, and the classification accuracy is 
calculated as m'/m, where to' is the number of correctly classified testing data 
points. For each n', 200 Monte Carlo replicates are run to randomly sample n' 
out of the total n domain relation learning training data points x? fe , i = 1, . . . , n. 
The average accuracy is plotted; standard errors are available via bootstrap 
resampling. 

The dashed triangle curve is similar to the solid circle curve except the 
training data is from A| instead of A|. Since A| and Af are within different 

II A 9 I 

ranges, prescaling is needed, which is done via A 2 = A^jj^p^. 

The remaining three curves (dotted plus, dotdash diamond, longdash aster- 
isk) show the results of the generalized CCA, which embeds Af, Af and A 2 
simultaneously to get yfi ,yf 2 an d y'2 * = 1, ■ ■ • ,563 (with prescaling for A| 
via A| ||^t ||^ )■ For all three curves, yf* is the testing data. For the dotted plus 
curve, the 5-NN classifier is trained on y? 2 . For the dotdash diamond curve, 
training data is y* 2 . And the longdash asterisk curve is for the classification 
performance trained on (yf 2 + y- 2 )/2. 

Based on the results shown in Figures |4a| [4b] and |4cJ when canonical corre- 
lation analysis is used to embed the pair (Af , A 2 ) or (Af , A 2 ) in the same low 
dimensional space K d , A 2 outperforms Af in terms of classifying Af for TFIDF 
and MI text features. But if we consider the generalized canonical correlation 
analysis on mapping Af , Af and A 2 to M. d simultaneously, it improves the em- 
bedding training in terms of classification performance. That is, to classify the 
embeddings of Af , the 5-NN classifer trained on y? 2 * and tested on yf* (dotted 
plus curve) outperforms the one trained on yf 2 and tested on y^ (solid circle 
curve) , and similar result holds for the pair of Af and A| (dotdash diamond and 
dashed triangle curves). This indicates incorporating information from an addi- 
tional domain improves upon the embedding obtained via canonical correlation 
analysis in terms of classification task. The best classification results (longdash 
asterisk curves) come from the case not only using generalized canonical cor- 
relation analysis for domain relation learning training, but also using both yf 2 
and y* 2 for classification training. 

Instead of using the MDS dimensions given in Table [l] we also consider a 
lower dimension d" for each considered n' . By reducing the MDS dimensions, 
we impose additional regularization. Thus we refer to this as Regularized CCA 
and GCCA. How to choose the values of d" properly is a non-trivial model 
selection problem. The values of d" imply the regularization level. We choose 
d" to be smaller than d' to remove noisy dimensions from MDS embedding, but 
not too small to keep the fidelity of MDS. We use d" — d' /2. The classification 
results are shown in Figures |4d[ [4e| and [4f| and they are better than the non- 
regularized CCA and GCCA results in Figures [4a] |4b| and [4c] which is consistent 
with our expectation. The improvement of regularized CCA and regularized 
GCCA over their non-regularized counterparts comes from the removal of the 
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Table 2: Classification Accuracy 





Non-regularized 


Regularized 




S = 10% 
d' = 40 


S = 100% 
d! = 200 


S = 10% 
d" = 20 


S = 100% 
d" = 100 


CCA 
(GF -> GE) 


LSI 
TFIDF 
MI 


61.24% ±0.12% 
61.73% ± 0.12% 
61.88% ±0.12% 


63.50% ±0.10% 
63.48% ± 0.10% 
63.51% ±0.10% 


66.67% ±0.13% 
66.69% ± 0.13% 
66.47% ±0.13% 


71.84% ±0.10% 
71.85% ± 0.10% 
71.85% ±0.10% 


CCA 
(TF -> GE) 


LSI 
TFIDF 
MI 


64.75% ±0.11% 
65.64% ± 0.12% 
67.14% ±0.09% 


67.05% ±0.18% 
75.13% ± 0.10% 
71.05% ±0.09% 


68.51% ±0.14% 
68.43% ± 0.15% 
71.03% ±0.12% 


76.20% ±0.11% 
77.09% ± 0.11% 
76.91% ±0.11% 


GCCA 
(GF -> GE) 


LSI 
TFIDF 
MI 


65.30% ±0.14% 
65.57% ±0.15% 
66.30% ±0.14% 


74.42% ±0.10% 
70.70% ±0.11% 
71.40% ±0.10% 


66.91% ±0.14% 
66.84% ±0.14% 
66.80% ±0.14% 


74.42% ± 0.08% 
72.47% ± 0.09% 
74.60% ± 0.08% 


GCCA 
(TF -> GE) 


LSI 
TFIDF 
MI 


69.21% ±0.12% 
69.33% ±0.13% 
70.63% ±0.12% 


74.07% ±0.13% 
75.31% ±0.10% 
78.15% ±0.08% 


69.77% ±0.13% 
69.41% ±0.15% 
72.24% ±0.12% 


78.51% ± 0.07% 
77.09% ± 0.09% 
79.04% ± 0.06% 


GCCA 
(GTF ->■ GE) 


LSI 
TFIDF 
MI 


71.31% ±0.11% 
70.53% ±0.11% 
70.66% ±0.11% 


77.26% ±0.11% 
79.93% ±0.10% 
77.26% ± 0.09% 


71.02% ±0.12% 
69.77% ±0.13% 
70.23% ±0.12% 


83.21% ± 0.06% 
81.61% ±0.08% 
80.82% ± 0.08% 



noisy dimensions in the MDS embedding. 

Table [2] shows the classification accuracy of various methods for S — 10% 
and S = 100%. 

In the experimental settings described above, the documents in classes 1 
and 3 are used for classifier training and testing, while the documents in the 
remaining three classes (0, 2, 4) are the domain relation learning training data. 
Experimental results in Figure [4] and Table [2] show that GCCA is superior to 
CCA. However, it remains questionable whether this phenomenon holds in other 
settings. We investigate this problem via choosing different classes combinations 
for classifier training and testing. In addition to the choice of classes 1 and 3 
used above, we also considered other possible combinations of two classes for 
classifier traning and testing. Regularized GCCA is considered here because it 
yields the best classification performance in the previous experimental settings. 
Given two classes for classifier training and testing, we use all domain relation 
learning training data available, that is, all the documents in the remaining 
three classes (S — 100%). The embedding dimension for MDS is d" — 100 as 
specified earlier in Table [2j Regarding the text feature, latent semantic indexing 
is selected. The results of the investigation outlined above are shown in Table [3j 
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Table 3: Classification Accuracy for All Classes Combinations 





Regularized (LSI text feature), S = 100%, d" = 100 


Classification 
Classes 


CCA 
(GF -> GE) 


CCA 
(TF -> GE) 


GCCA 
(GF -> GE) 


GCCA 
(TF -> GE) 


GCCA 
(GTF -> GE) 


0, 1 


75.36% ± 0.04% 


67.82% ±0.11% 


80.24% ± 0.03% 


73.93% ± 0.03% 


77.39% ± 0.04% 


0, 2 


74.29% ± 0.06% 


66.58% ±0.11% 


83.03% ± 0.05% 


75.84% ± 0.04% 


86.89% ± 0.05% 


0, 3 


80.00% ± 0.08% 


71.94% ±0.17% 


85.48% ± 0.05% 


87.42% ± 0.07% 


95.81% ± 0.04% 


0,4 


76.14% ±0.05% 


67.40% ± 0.07% 


78.51% ± 0.04% 


75.41% ± 0.03% 


77.41% ± 0.04% 


1, 2 


59.19% ±0.07% 


58.10% ±0.09% 


61.99% ±0.07% 


63.71% ± 0.07% 


66.98% ± 0.06% 


1, 3 


71.84% ±0.10% 


76.20% ±0.11% 


74.42% ± 0.08% 


78.51% ± 0.07% 


83.21% ± 0.06% 


1,4 


55.74% ± 0.06% 


53.12% ±0.07% 


61.60% ±0.06% 


57.11% ±0.08% 


65.84% ± 0.06% 


2, 3 


59.22% ±0.12% 


67.46% ±0.12% 


64.64% ± 0.11% 


67.25% ± 0.09% 


69.85% ± 0.09% 


2,4 


65.71% ± 0.07% 


64.29% ± 0.07% 


71.43% ±0.05% 


69.43% ± 0.05% 


73.00% ± 0.04% 


3,4 


73.11% ±0.08% 


73.91% ± 0.09% 


76.81% ± 0.05% 


76.97% ± 0.07% 


82.13% ±0.05% 



where each row corresponds to one pair of classes. For example, the first row in 
Table § is for the case where classes and 1 are used for classifier training and 
testing, and all documents in classes 2, 3, 4 are the domain relation learning 
training data. The results in Table [3] indicate that GCCA performs better than 
CCA for different choices of class combinations, thus strongly supporting the 
conclusion that GCCA is superior to CCA in terms of classification accuracy. 

Inferences regarding differences in the relative performance between compet- 
ing methodologies (as well as the seemingly non-monotonic performance across 
S for a given methodology) are clouded by the variability inherent in our per- 
formance estimates. However, these real-data experimental results nonetheless 
illustrate the general relative performance characteristics of CCA and GCCA 
and their regularized versions, as a function of S. 

5. Conclusion 

Canonical correlation analysis and its generalization are discussed in this 
paper as a manifold matching method. They can be viewed as reduced rank 
regression, and they are applied to a classification task on wikipedia documents. 
We show their performance with manifold matching training data from differ- 
ent domains and different dissimilarity measures, and we also investigate their 
efficiency by choosing different amounts of manifold matching training data. 
The experiment results indicate that the generalized canonical correlation anal- 
ysis, which fuses data from disparate sources, improves the quality of manifold 
matching with regard to text document classification. Also, if we use regular- 
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ized canonical correlation analysis and its generalization, we further improve 
performance. 

Finally, increasing the amount of domain relation learning training data from 
10% to 100% (S in the Figures Ha) EE) He] [3d] He] and Hf) of the available 819 



documents yield approximately 10% improvement in classification performance. 
This improvement is independent of the amount of training data available for 
the classifier. 
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Figure 4: Classification accuracy with different amount of domain relation learn- 
ing training data for (G)CCA the regularized (G)CCA with LSI, TFIDF, and 
MI text features 
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